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Abstract 

We considered the fully overlapping triplets of nucleotide bases and proposed a 2D graphical representation of protein sequences con¬ 
sisting of 20 amino acids and a stop code. Based on this 2D graphical representation, we outlined a new approach to analyze the phy¬ 
logenetic relationships of coronaviruses by constructing a covariance matrix. The evolutionary distances are obtained through measuring 
the differences among the two-dimensional curves. 

© 2006 Elsevier B.V. All rights reserved. 


1. Introduction 

Compilation of DNA primary sequence data continues 
unabated and tends to overwhelm us with voluminous out¬ 
puts that increase daily. Comparison of primary sequences 
of different DNA strands remains one of the important 
aspect of the analysis of DNA data banks. Mathematical 
analysis of the large volume genomic DNA sequence data 
is one of the challenges for bio-scientists. There are three 
class methods for the analysis of DNA sequences: (i) Align¬ 
ment [1,2]. (ii) Matrices: (1) matrices in which an individual 
entry corresponds to an individual pair of bases [3,6,7] and 
(2) matrices in which entries summarize information of dif¬ 
ferent X-Y pairs of bases [4,5,7]. (hi) Graphical representa¬ 
tion: Graphical representation of DNA sequence provides 
a simple way of viewing, sorting and comparing various 
gene structures. Graphical techniques have emerged as a 
very powerful tool for the visualization and analysis of long 
DNA sequences. These techniques provide useful insights 
into local and global characteristics and the occurrences, 
variations and repetition of the nucleotides along a 
sequence which are not as easily obtainable by other meth- 
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ods. In recent years several authors outlined different 
graphical representation of DNA sequences based on 2D, 
3D or 4D [8-20]. Based on these graphical representation, 
several authors outlined some approaches to make com¬ 
parison of DNA sequences [21-25]. 

All this methods are based on the (four letter alphabet, 
A, C, G, and T standing for nucleotide bases adenine, cyto¬ 
sine, guanine, and thymine, respectively). We will change to 
consider the fully overlapping triplets of nucleotide bases. 
Consideration of triplets of nucleotide bases instead of 
individual nucleotide bases has several reasons and advan¬ 
tages. There are three of them: (i) The genetic code consists 
of triplets (codons) of DNA (or RNA in some virus) nucle¬ 
otides. (ii) The second advantage is that one can easily find 
the open reading frame as the longest sequence of triplets 
that contains no stop codons when read in a single reading 
frame, (iii) The computation will become more simple. 

In this Letter, we proposed a 2D graphical representa¬ 
tion of the protein sequences consisting of 20 amino acids 
and a stop code. Based on this 2D graphical representation, 
we outlined a new approach to analyze the phylogenetic 
relationships of coronaviruses. The evolutionary distances 
are obtained through measuring the differences among 
the two-dimensional curves. Unlike most existing phylog¬ 
eny construction methods [26-31], the proposed method 
does not require multiple alignment. 
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2. 2D graphical representation of protein sequences and 
properties 

As is known, all of the 64 triplets of nucleotide bases 
correspond 20 amino acids and a stop code. There are 
three reading frame start at position 1 , 2 and 3, respec¬ 
tively. Using the translate tool, we can obtain three pro¬ 
tein sequences consisting of 20 amino acids and a stop 
code. The 20 amino acids found in proteins can be 
grouped according to the chemistry of their R groups 
as in [32]: amino acids A,V,F,P,M,I,L belong to the 
hydrophobic chemical group; amino acids D,E,K,R 
belong to charged chemical group; amino acids 
S,T,Y,H,C,N,Q,W belong to polar chemical group; 
amino acid belong to glycine chemical group. Then for 
any DNA sequence, we will transform it into three new 
sequences defined over alphabet {//, C,P, G}. The rule 
is as follows: 

<l>{g{3i-2,3i-l,3i)) 

'H \ig(3i-2,3i- 1,3;) =A, V,F,P,M,I,L 
C g(3i-2,3i-\,3i) =D,E,K,R 

~ I P ifg(3i-2,3i- 1,30 = S,T,Y,H,C,N,G,W 
ifg(3i-2,3i-1,30 = G, - 

As shown in Fig. 1, we construct a pyrimidine-purine 
graph on two quadrants of the cartesian coordinate sys¬ 
tem, with pyrimidines {P and C) in the first quadrant 
and purines (H and G) in the fourth quadrant. The unit 
vectors representing four alphabets H^G^C and P are as 
follows: 

(w, —s/n) //, (v^, m) G, (v^, m) C, (w, y^) ^ P 

where m is a real number and m 7 ^ ^/n, ^ is a positive real 
number but not a perfect square number. So that we will 
reduce a DNA sequence into a series of nodes Po,Pi,P 2 , 
... ,PiN/ 3 j, whose coordinates y, (/ = 0 , 1 , 2 ,... ,La/3J, 
where N is the length of the DNA sequence being studied) 
satisfy 

I Xi = him + + c, + PiM ^ ^ ^ 

\ yi = -hiy^ - giM + CiM + Pi^ 

hi,Ci,gi and pi satisfy 



Fig. 1. Pyrimidine-purine graph. 


hi = Ai + y^V i + i + y^P i 

+ s/^Ii + s/^Li 

G = A' + y^A + + y^A . . 

Si — A + \AloA + i + yJ^Hi 

+vT3C/ + i + s/sp)Qi + 

^ Pi = Gi + y^Qi 

where Vi,Fi,Pi,Mi,Ii,Li,Di,Ei,Ki,Ri,Si, A, A, A, Q, A 2 r 

Wi,Gi,Qi; are the cumulative occurrence numbers of A, 
V,F,PMJ,L,D,E,K,R,S,T,Y,H,C,N,Q,W,G and -(or stop 
code), respectively, in the subsequence from the 1 st base 
to the ith base in the sequence. And Sf^,k = 1, ... ,17 are 
positive real number but not perfect square number. 
Si 7 ^ Sj,iJ = 1, ... ,17, and m ^ yjsl, m 7 ^ yTFf, ms/sk 7 ^ s/h, 
1,..., 17. We define Aq= Fq = A = Fq = ^0 = A = 
Lo = Do = Fo = Fo = Fo = A= A = A = ^o=Q = A = 
2o = fko = Go = A = 0- 

We called the corresponding plot set be characteristic 
plot set. The curve connected all plots of the characteristic 
plot set in turn is called characteristic curve, which is 
determined by m, n, that satisfy above mentioned condi¬ 
tion. In Figs. 2-4, we show the SARS corresponding curves 
with different parameters n and m, where ^1 = 2/3 ;^2 = 3/4; 
^3 = 4/5;s4 = 5 / 6;^5 = 6ll\s^ = ll^;sj = 8 / 9;^8 = 9 / 10;^9 = 
10/11; ^10 = ll/12;^n = 12/13;^i2 = 13/14;yi3 = 14/15; 

^14 = 15/16;^i5 = 16/17;^i 6 = 17/18;^i7 = 18/19. Observing 
Figs. 2^, we find SARS have similar curves despite with 
different parameters n and m. 

Property 1. For a given DNA sequence there are three 2D 
representations corresponding to it. 

Proof. Using the translate tool, one can obtain three pro¬ 
tein sequences consisting of 20 amino acids and a stop code 
corresponding three reading frame start at position 1 , 2 and 
3. In a single reading frame, let (x„ y,) be the coordinates of 
the /th amino acid of protein sequence, then we have 



Fig. 2. SARS corresponding curve with different parameters n and m 
based on the first reading frame. 
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Fig. 3. SARS corresponding curve with different parameters n and m 
based on the second reading frame. 



Fig. 4. SARS corresponding curve with different parameters n and m 
based on the third reading frame. 

hi{m, -^/n) + giiVn, -m) + Ci{yn, m) + pi{m, ^/n) = 


i.e. 


hiUi + g,V« + c,V« + Pirn = Xi 
-hi^/n - gitn + Citn +Pi\fn= j,. 
□ 


(3) 


Obviously, Xi and y, are irrational numbers of form 
sm + k^/n, where ^ and k are integers. We suppose 


Xi = s^m + k^yn 

= SyTH + kys/n 

then we have 

{ hi + Pi = 
gi + Ci = k 

-gi + Ci = ^ 3 ; 

L -hi + Pi = ky 


{ 






( 4 ) 


So, for given x-projection and y-projection of any point 
P = (x, y) on the sequence, after uniquely determining 
Sjc,kx,Sy,ky from x and y, the number Ap, Vp,Fp,Pp, 

piEpiKp^Rp^Sp^Tp, Yp^Hp^ Cp,Np,Qp, ^pik^p f^f 

A, V,F,P,MJ,L,D,E,K,R,S,T, Y,H,C,N,Q, W,G and -(or 
stop code) from the beginning of the sequence to the point 
P can be found by solving linear system (2) and (4). 

The vector pointing to the point P, from the origin O is 
denoted by r,. The component of r,, i.e. x, and y, are calcu¬ 
lated by Eqs. (1) and (2). Let Ar, = r, — r, _ 1 , then we have 
Property 2. 

Property 2. For any i= 1,2, where N'is the length 

of protein sequenee eorresponding the studied DNA 
sequenee, the veetor Ar, has only twenty one possible 
direetion. Furthermore, the length of /Sip i.e.,|A rf is always 
equal to sjfm -\-n), for any i=l,2,...,N, k = 0,l,..., 
\7,so = 1. 

Proof. Actually, the components of Ar,, i.e.. Ax, and Ay, 
can be calculated for each possible residue (A, 
V,F,P,M,I,L,D,E,K,R,S,T,Y,H,C,N,Q,W,G and -) at the 
zth position of the protein sequence by using Eqs. (1) and 
(2). Eor example, when the /th residue is A, we find Ax, = m 
and Ay^ = — This result is independent of the confor¬ 
mation state of the (/ — l)th residue. The two numbers 
{m, —s/n) are called the direction of Ar,. The direction num¬ 
ber and the length of Ar, for each possible residue type at 
the ith position are summarized. □ 


Property 3. There is no eireuit or degeneraey in our two- 
dimensional graphieal representation. 

Proof. We assume that: (1) the number of amino acid 
forming a circuit is /; (2) the number of A,V,F,P,M,I,L,- 
D,E,K,R,S, T,Y,H,C,N,Q,W,G and —(or stop code) in a 
circuit is a',v'f ,p',m',i',1',d', q', k',r',s',t',y',h',e',n',q',w',g' 
and d' , respectively. So a' + v' +/ Fp' F m' A- i' + 
l'-\-d'-\-Q'-\-k'-\-r'-\-s'-\-t'-\-y'-\-h'-\-e'-\-n'-\-q'-\-w'-\- 
g'-\-d' = L Because a'A,v'VfF,p'P,m'M,i'I,l'L,d'D, 
e’E,k’K,r’R,s’S,t’T,yY,h’H,e’C, n'N,q'Q,w'W,g'G and 
—(or stop code) form a circuit, the following equation 
holds: 


^ h = 
d = 

< g' = 


<P' -- 
h'{m, 


a' A ykxd A s/k^f^ T s/k^p' A yT^m' + yk^i' + yk/^l 
d' + y/Fie' + ^/k^k' + yk^P 

k + y^kiot' + Y^^y^ + -s/kub + s/l3c' + ^/kl4p (^) 
Ay^q' + 
g' + 

x/n) + g'iVn, -m) + c'( m) + p’{m, sjn) = (0,0) 


I.e.. 


hm A g'yn + c'yn A p'm = 0 
—h'yn — g'm A dm A i'x/n = 0 


( 6 ) 
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Clearly Eqs. (5) and (6) hold if, and only if a' = v' = 
f = p' = m' = i' = I' = d' = q' = k' = r' = s' = t' = y' = h' = 
c' = n' = q' = w' = g' = d' = 0. Therefore, / =0, which 
means no circuit exists in this graphical representation. □ 


Property 4. The 2D representation possesses the refleetion 
symmetry. 


Proof, usually the sequence is expressed in the order from 
5' to 3'. Suppose that the 2D representation for protein 
sequence is described by (x^, y,),/= 0,1,2,... ,A^. Suppose 
again that the 2D representation for the reverse sequence, 
i.e, the same sequence but from 3' to 5' is described by 
(i,we find 


^N—i 

yt = Tx - yN-i 
□ 



3. Phylogenetic tree of coronaviruses 


For any DNA sequence, we have three translating pro¬ 
tein sequences. For any protein sequence, we have a set of 
points (Xi, yi)d= 1,2,3, ... ,N, where N is the length of 
the sequence. The coordinates of the geometrical center 
of the points, denoted by and y^, may be calculated 
as follows: 



The element of covariance matrix CM of the points are 
defined: 


N 

CM^ = i LiSi - 
1 


< 


CM „,=i f:(x, 

1 




xy 


yx 


1 


■yy 



(See Table l)The above four numbers give a quantitative 
description of a set of point (x,, y,),/ = 1,2,... ,7V, scattering 
in a two-dimensional space. Obviously, the matrix is a real 
symmetric 2x2 one. There is a leading eigenvalue for a 
matrix CM. So that there are three geometrical centers and 
three leading eigenvalue corresponding a DNA sequence. 
In Table 2, we list the geometrical centers (x^,y^),^ = 
1,2,3 and leading eigenvalues belonging to 24 species 
with parameter m = ^ ,n = ^,si = 2/3;s2 = 3/4;s3 = 4/5; 
S 4 = 5/6;ss = ()/l;S(, = 7 / 8;^7 = 8 / 9;^8 = 9 / 10;^9 = 10 / 11 ; 
^10 = 11/12; ^11 = 12/13; ^12 = 13/14; ^13 = 14/15; ^14 = 15/16; 
^15 = 16/17;^i6 = 17/18;^i 7 = 18/19 (See Table 3). 

In order to facilitate the quantitative comparison of dif¬ 
ferent species in terms of their collective parameters, we 
introduce a distance scale as defined below. Suppose that 
there are two species i and j, the parameters are 
, ^ 2 , >^ 3 ,, >^ 2 , 2 ^, respectively, where , 2 ^, 23 are the three 
leading eigenvalues of matrix CMf corresponding to species 
i. The distance dy between the two points is 

dij^J ( 2 ^^ “ 2 '^) + ( 2 ^ “ 22 ) + (23 “ 23 ) ,/,7 = 1 , 2 ,... ,71/ ( 10 ) 


Table 1 


The accession number, abbreviation, name and length for the 24 coronavirus geneomes 


No. 

Accession 

Abbreviation 

Genome 

Length (nt) 

1 

AC_002645 

HCo V_229E 

Human coronavirus 229E 

27317 

2 

AC_002306 

TGEV 

Transmissible gastroenteritis virus 

28586 

3 

AC_003436 

PEDV 

Porcine epidemic diarrhea virus 

28033 

4 

U00735 

BCoVM 

Bovine coronavirus strain Mebus 

31032 

5 

AF391542 

BCoVL 

Bovine coronavirus isolate BCoV-LUN 

31028 

6 

AF220295 

BCoVQ 

Bovine coronavirus Quebec 

31100 

7 

AC_003045 

BCoV 

Bovine coronavirus 

31028 

8 

AF208067 

MHVM 

Murine hepatitis virus strain ML-10 

31233 

9 

AF101929 

MHV2 

Murine hepatitis virus strain 2 

31276 

10 

AF208066 

MHVP 

Murine hepatitis virus strain Penn 97-1 

31112 

11 

AC_001846 

MHV 

Murine hepatitis virus 

31357 

12 

AC_001451 

IBV 

Avian infectious bronchitis virus 

27608 

13 

AY278488 

BJOl 

SARS coronavirus BJOl 

29725 

14 

AY278741 

Urbani 

SARS coronavirus Urbani 

29727 

15 

AY278491 

HKU-39849 

SARS coronavirus HKU-39849 

29742 

16 

AY278554 

CUHK-Wl 

SARS coronavirus CUHK-Wl 

29736 

17 

AY282752 

CUHK-SulO 

SARS coronavirus CUHK-SulO 

29,736 

18 

AY283794 

SIN2500 

SARS coronavirus Sin2500 

29711 

19 

AY283795 

SIN2677 

SARS coronavirus Sin2677 

29705 

20 

AY283796 

SIN2679 

SARS coronavirus Sin2679 

29711 

21 

AY283797 

SIN2748 

SARS coronavirus Sin2748 

29706 

22 

AY283798 

SIN2774 

SARS coronavirus Sin2774 

29711 

23 

AY291451 

TWl 

SARS coronavirus TWl 

29729 

24 

AC_004718 

TOR2 

SARS coronavirus 

29751 








Table 2 

Twenty one possible direction 
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|Ar„| 

A 

m 

-s/h 

F n 

D 

s/n 

m 

+ n 

S 

\/n 

—m 

+ n 

G 

m 

yTz 

+ n 

V 



Si{m + n) 

F 

my/Fi 


S 2 {m^ + n) 

P 


y^ 

s^im^ + n) 

M 



s4jn^ + n) 

I 


y^ 

Ss{m^ + n) 

L 


y^ 

s^{m^ + n) 

E 

y^ 


S']{m^ + n) 

K 

y^ 

my^ 

Ss{m^ + n) 

R 

^ynsg 

my^ 

Sg{m^ + n) 

T 

yTz^ 

-my^ 

Sio(m^ + n) 

Y 



Sii{m^ + n) 

H 

y/nCG 

-my/FG 

siiirn^ + n) 

C 

yTr^ 


+ n) 

N 

yTr^ 


si^{m^ + n) 

Q 



sis{m^ + n) 

W 

yTz^ 

-my^ 

Si^{m^ + n) 

— 

my/FCi 

y7z5l7 

Si'j{m^ + n) 


where dg denotes the distance between the geometric cen¬ 
ters of the ith and the yth genomes, and M is the total num¬ 
ber of all genomes (M= 24, here). Then we obtain a real 
Mx M symmetric matrix whose elements are d^. 

Accordingly, a real symmetric Mx M matrix Dij is 
obtained and used to reflect the evolutionary distance 
between the species i and j. The clustering tree is 
constructed using the UPGMA method in Phylip 



Fig. 5. Phylogenetic tree. 


package (http://evolution. genetics .Washington. edu/phy¬ 
lip. html). The final phylogenetic tree is drawn using 
the Drawgram program in the Phylip package. In 
Fig. 5, we present the phylogenetic tree belonging to 24 
species. 

4. Conclusion 

We made a analysis of DNA sequences by considering 
the fully overlapping triplets of nucleotide bases. The pre¬ 
sented graphical representation can be recaptured mathe- 


Table 3 


The geometric centers and three leading eigenvalues for each of the 24 coronavirus genomes 


i 

Xj 




A 



y\ 

x^ 





^2 


1 

2.5692e 

+ 

003 

-159.0439 

2.5566e 

+ 

003 

-342.5873 

2.6794e 

+ 

003 

389.8249 

2.1520 

2.2321 

2.3707 

2 

2.8619e 

+ 

003 

-230.4309 

2.8245e 

+ 

003 

-723.2605 

2.9971e 

+ 

003 

128.9913 

2.6999 

2.8393 

2.9157 

3 

2.8626e 

+ 

003 

-233.0932 

2.8231e 

+ 

003 

-724.5553 

2.9976e 

+ 

003 

130.5104 

2.7034 

2.8386 

2.9178 

4 

2.8602e 

+ 

003 

-245.6989 

2.8245e 

+ 

003 

-743.4898 

2.9985e 

+ 

003 

133.2708 

2.7056 

2.8453 

2.9209 

5 

2.8688e 

+ 

003 

-294.6379 

2.8364e 

+ 

003 

-709.6245 

3.0012e 

+ 

003 

146.3851 

2.7519 

2.8561 

2.9158 

6 

2.6263e 

+ 

003 

415.1362 

2.5204e 

+ 

003 

-204.5027 

2.4666e 

+ 

003 

-516.9428 

2.2817 

2.0813 

2.1269 

7 

2.8773e 

+ 

003 

-476.9658 

2.8773e 

+ 

003 

-476.9658 

2.9006e 

+ 

003 

-252.7994 

2.8910 

2.8910 

2.7932 

8 

2.8902e 

+ 

003 

-446.8927 

2.8902e 

+ 

003 

-446.8927 

2.9139e 

+ 

003 

-227.7537 

2.9004 

2.9004 

2.8179 

9 

2.8853e 

+ 

003 

-459.6862 

3.0344e 

+ 

003 

82.5446 

2.8912e 

+ 

003 

-273.7115 

2.9146 

2.9739 

2.7829 

10 

2.8582e 

+ 

003 

-528.7428 

3.0320e 

+ 

003 

34.9426 

2.8807e 

+ 

003 

-253.2886 

2.8697 

2.9882 

2.7408 

11 

2.5137e 

+ 

003 

-415.8854 

2.6893e 

+ 

003 

244.2464 

2.5817e 

+ 

003 

-222.8666 

2.221 \ 

2.3287 

2.1831 

12 

2.7670e 

+ 

003 

-48.3996 

2.7276e 

+ 

003 

-34.7759 

2.8570e 

+ 

003 

524.7574 

2.4705 

2.5740 

2.6849 

13 

2.7255e 

+ 

003 

-35.7080 

2.8550e 

+ 

003 

526.4976 

2.7646e 

+ 

003 

-43.8066 

2.5698 

2.6804 

2.4654 

14 

2.7656e 

+ 

003 

-45.9837 

2.7262e 

+ 

003 

-35.1151 

2.8557e 

+ 

003 

528.0186 

2.4675 

2.5711 

2.6821 

15 

2.7659e 

+ 

003 

-45.2775 

2.7260e 

+ 

003 

-36.4889 

2.8558e 

+ 

003 

530.0127 

2.4680 

2.5710 

2.6828 

16 

2.7656e 

+ 

003 

-47.8004 

2.7267e 

+ 

003 

-33.6628 

2.8560e 

+ 

003 

527.4290 

2.4680 

2.5725 

2.6838 

17 

2.7239e 

+ 

003 

-35.1426 

2.8535e 

+ 

003 

527.3351 

2.7632e 

+ 

003 

-45.2702 

2.5669 

2.6777 

2.4630 

18 

2.7233e 

+ 

003 

-36.1921 

2.8529e 

+ 

003 

527.2583 

2.7627e 

+ 

003 

-45.4289 

2.5657 

2.6766 

2.4620 

19 

2.7239e 

+ 

003 

-34.4434 

2.8535e 

+ 

003 

527.8162 

2.7633e 

+ 

003 

-45.2775 

2.5667 

2.6780 

2.4630 

20 

2.7239e 

+ 

003 

-35.6707 

2.8525e 

+ 

003 

525.5247 

2.7621e 

+ 

003 

-43.2715 

2.5678 

2.6737 

2.4587 

21 

2.7241e 

+ 

003 

-35.5425 

2.8535e 

+ 

003 

527.2287 

2.7634e 

+ 

003 

-45.5734 

2.5675 

2.6777 

2.4636 

22 

2.7647e 

+ 

003 

-48.0684 

2.7258e 

+ 

003 

-35.7184 

2.8553e 

+ 

003 

523.7099 

2.4661 

2.5700 

2.6815 

23 

2.7647e 

+ 

003 

-47.8421 

2.7252e 

+ 

003 

-35.8263 

2.8547e 

+ 

003 

524.8910 

2.4661 

2.5692 

2.6808 

24 

2.61 lOe 

+ 

003 

-251.1068 

2.7585e 

+ 

003 

459.3175 

2.6727e 

+ 

003 

-97.0235 

2.3573 

2.4587 

2.3322 
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matically without loss of textual information. And our rep¬ 
resentation provides a direct plotting method to denote 
DNA sequences without degeneracy. 

Most existing approaches for phylogenetic inference 
use multiple alignment of sequences and assume some 
sort of an evolutionary model. The multiple alignment 
strategy does not work for all types of data, e.g., whole 
genome phylogeny, and the evolutionary models may not 
always be correct. The current two-dimensional graphical 
representation of DNA sequences provides different 
approach for constructing phylogenetic tree. Unlike 
most existing phylogeny construction methods, the pro¬ 
posed method does not require multiple alignment. Also, 
both computational scientists and molecular biologists 
can use it to analysis protein sequences efficiently. We 
can obtain some graphical representation of protein 
sequence based on 2D, 3D and 4D using the following 
transform: a, ^ g,, c, ^ c,, U Pi. hi, Ci,gi and 

Pi satisfy Eq. (2). ai,Ci,gi and ti are the cumulative occur¬ 
rence numbers of A, C, G and T, respectively, in the 
subsequence from the 1st base to the /th base in the 
sequence. 
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